Citi Bikes: A data science challenge

In [1]:
from IPython import display ;display.Image("https://images.ctfassets.net/vz6nkkbc6q75/3avkbm7rsk9oOWAEPBy9YY/4d8d64a8613352398af4e651edd4b001/NYC-Blog-Header-837x335-1__1_.jpg",width = 900, height = 50)
Out[1]:

The report is based on the 2018 trips data from Citi Bikes website consisting of 17,548,339 trips(2018). Initially, the volumes are observed in light of calendar and weather features, followed by clustering exercises of days and bike stations and expansion opportunities are explored in potential neighborhoods.Finally, prediction problems are illustrated with results and discussion. To start with, calendar are derived out of the start time stamps to proceed with the analysis after import.

In [2]:
#<!--eofm-->
In [2]:
%run User_defined_Functions_EDA_RQ.ipynb
bike = import_data('Data\Trips_2018.csv'); bike = calculated_fields(bike)
bike[['start_time_stamp','dayofmonth', 'start_month', 'week_number','start_hour', 'start_monthname', 'start_dayofweek_name',
       'start_dayofweek', 'weekend', 'tripduration_min']].head(n=2)
(17548339, 14)
C:\Users\ANTARL~1\AppData\Local\Temp/ipykernel_2180/1002675140.py:16: FutureWarning: Series.dt.weekofyear and Series.dt.week have been deprecated.  Please use Series.dt.isocalendar().week instead.
  bike['week_number'] = bike['start_day'].dt.week # extracting week number
Out[2]:
start_time_stamp dayofmonth start_month week_number start_hour start_monthname start_dayofweek_name start_dayofweek weekend tripduration_min
start_day
2018-01-01 2018-01-01 13:50:57.434 1 1 1 13 January Monday 0 0 16.0
2018-01-01 2018-01-01 15:33:30.182 1 1 1 15 January Monday 0 0 12.0

Exploratory Data Analysis:
Daily volumes, Weekdays vs Weekend, trip duration

In [3]:
%run User_defined_Functions_EDA_RQ.ipynb ; 
DailyPlots(bike,25,35) ; print("Time lapse on a weekday");weekdays_and_weekends_trip_duration(bike);bike_heat_map(bike)
Time lapse on a weekday
Out[3]:
Make this Notebook Trusted to load map: File -> Trust Notebook

A weekly pattern in volumes can be observed(plot 1). In the next two plots,Weekday volumes peaked around 7am-9am and 5pm-7pm indicating rush hour. Weekend volume peaked between 1pm-5 pm indicating informal outings.Therefore,hours and days of the week appear as important factors determining hourly volumes. Further, longer trip durations were observed during the mid of the year with the gap in trip duration levels between January and June being ~50%.

Research Question 1: Monthly pickups and weather conditions

Daily weather data of 2018 was sourced and combined with the monthly pickups data to look for any correlation between daily volumes and weather elements

In [4]:
%run User_defined_Functions_EDA_RQ.ipynb
bike_weather = import_weather(); weather_plots(bike_weather) # contains data at trip level after joining the weather elements(which are at daily level)

Temperature has high +ve correlation with volumes making summers conducive for biking as opposed to snowfall and snow depth(-ve correlated. Wind has a -ve but weak correlation, indicating that the wind speeds are generally low. Rain(PRCP) has +ve correlation which is counter-intuitive, albeit low. A hourly correlation check may have revealed a +ve relation.5-7 day weather forecasts, therefore appear to be good candidates as predictors for pickup volumes (see studies)

Research Question 2: Temporal Clustering on pickup volumes

Learning from the effect of weather conditions on bike usage, days of the year were attempted to be clustered. Attributes such as weekday/weekend or event(holidays or weather-related) which could affect bike traffic were added alongside hourly volumes.

In [41]:
%run User_defined_Functions_EDA_RQ.ipynb
bike_volume= group_by_hour(bike); bike_volume = insert_events(bike_volume); dt = create_df_days(bike_volume)
%store dt
dt_reduced = estimate_PCA_model_short(dt, n_components=2) #this is the dataframe with the 2 dimensions obtained from PCA
 #gaussian mixture model clustering is performed and the clusters are assigned to a new dataframe
dt_reduced_2 = gaussian_mixture_clustering(dt_reduced[[0,1]], n_components=4);add_more_event_days(dt_reduced)
C:\Users\ANTARL~1\AppData\Local\Temp/ipykernel_2180/1397492612.py:27: FutureWarning: Series.dt.weekofyear and Series.dt.week have been deprecated.  Please use Series.dt.isocalendar().week instead.
  bike_volume['week_number'] = bike_volume['start_day'].dt.week # extracting week number
Stored 'dt' (DataFrame)
The average silhouette_score is: 0.48870483479362636

Although silhouette score obtained is less than the one obtained for the K-means algorithm, it partitions the data in a way that aligns better with what was expected. The plot below shows a partition according to whether the day belongs to the class 'weekend or special event' or the class 'workday'. How a 'special event' was defined can be read on the Appendix I - Temporal Clustering.

In [42]:
%run User_defined_Functions_EDA_RQ.ipynb
#weather data and seasons data encoded is brought into a new dataframe
fnew_2 = weather_by_day(dt_reduced) ;classifier(fnew_2);print('Show cluster sizes \n', fnew_2.cluster.value_counts());
fnew_2.head(2)
regression score 0.779
C:\Users\Public\conda\envs\geo_env\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function plot_confusion_matrix is deprecated; Function `plot_confusion_matrix` is deprecated in 1.0 and will be removed in 1.2. Use one of the class methods: ConfusionMatrixDisplay.from_predictions or ConfusionMatrixDisplay.from_estimator.
  warnings.warn(msg, category=FutureWarning)
              precision    recall  f1-score   support

           0      0.651     0.953     0.774        43
           1      0.875     0.389     0.538        36
           2      1.000     0.812     0.897        16
           3      0.900     1.000     0.947        27

    accuracy                          0.779       122
   macro avg      0.856     0.789     0.789       122
weighted avg      0.818     0.779     0.759       122

Show cluster sizes 
 0    142
1    106
3     68
2     49
Name: cluster, dtype: int64
Out[42]:
date weekend_or_event AWND PRCP SNOW TMAX season_Spring season_Summer season_Winter cluster
0 2018-01-01 1 7.83 0.0 0.0 19 0 0 1 3
1 2018-01-02 0 8.05 0.0 0.0 26 0 0 1 0

Research Question 3: Clustering stations on volumes

Hour-wise avg. pickup volumes for weekdays and weekends respectively were calculated at station level and were attempted to be clustered using K-means method. K=4 was chosen and cluster means were plotted.

In [62]:
%run User_defined_Functions_EDA_RQ.ipynb
combined_data = bike_clustering(bike);create_clusters(combined_data) ; n_cluster=4; combined_data_clusters,cluster_size = fitted_clusters_and_plots(combined_data,n_cluster) ; 
bike_station = visualize_clusters_on_map(combined_data_clusters,bike,cluster_size) ; combined_data.head(n=2)
Out[62]:
start_hour avg_wkdy_pickup_0 avg_wkdy_pickup_1 avg_wkdy_pickup_2 avg_wkdy_pickup_3 avg_wkdy_pickup_4 avg_wkdy_pickup_5 avg_wkdy_pickup_6 avg_wkdy_pickup_7 avg_wkdy_pickup_8 avg_wkdy_pickup_9 ... avg_wknd_pickup_14 avg_wknd_pickup_15 avg_wknd_pickup_16 avg_wknd_pickup_17 avg_wknd_pickup_18 avg_wknd_pickup_19 avg_wknd_pickup_20 avg_wknd_pickup_21 avg_wknd_pickup_22 avg_wknd_pickup_23
start_station_id
72.0 1.878049 1.670588 1.322581 1.37037 1.058824 1.345455 2.101010 6.395161 12.124514 10.468504 ... 8.435644 8.979167 9.030000 8.824742 7.310000 5.453488 4.049383 2.676471 2.825397 2.366667
79.0 1.333333 1.350000 1.080000 1.00000 1.000000 1.156250 1.578616 2.205882 5.654762 4.715447 ... 5.237113 4.979592 5.367347 4.160000 4.308511 3.421687 2.987952 2.267606 2.070175 1.700000

2 rows × 48 columns

The cluster centers differed in magnitudes of pickups but not on trends(such as different peak hours across clusters).The biggest cluster( >50% of stations) has the lowest hourly pickups. The smallest cluster with only 35 stations has the highest volume indicating a demand imbalance and need for overnight bike rebalancing. Predicting pickup volumes (for rebalancing) for the respective clusters may be useful as the cluster centers are very distinct.

Research Question 4: Bike stations: Expansion Opportunity

As a final exploration, the demographics of the 59 neighborhoods of NY and the density of bike stations were analysed to look for opportunities of expanding the network.

In [63]:
%run User_defined_Functions_EDA_RQ.ipynb
district_level_metrics, district_level_metrics_2,neigh,neigh_bike, c_d_2  = neighborhood_demographics_import_processed(); c_d_2.to_excel('neighborhoods_demogs_bike_subway_counts.xlsx')
print("Underlying data for plots: sample"); print(c_d_2.head(n=1).T) ; chloropeth_plots(c_d_2); print("Scrollable consolidated map: ")
c_d_2.explore() # Scroll over to look at the various metrics for each neighborhood.
Underlying data for plots: sample
                                                                                 0
boro_cd                                                                        206
shape_area                                                              42664311.5
shape_leng                                                           35875.7117331
geometry                         (POLYGON ((-73.8718461029101 40.84376077785579...
Center_point                         POINT (-73.88752994402863 40.849602790434595)
lon                                                                      -73.88753
lat                                                                      40.849603
Car_Free_Commuters%                                                           87.4
Median household income (2020$)                                           157900.0
Pop_per_Sq_Mile                                                               28.5
DISTRICT                                                Park Slope/Carroll Gardens
Borough                                                                   Brooklyn
subway_count                                                                   2.0
station_count                                                                  0.0
C:\Users\Public\conda\envs\geo_env\lib\site-packages\pandas\core\dtypes\inference.py:383: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry.
  iter(obj)  # Can iterate over it.
C:\Users\Public\conda\envs\geo_env\lib\site-packages\pandas\core\dtypes\inference.py:384: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the  number of parts of a multi-part geometry.
  len(obj)  # Has a length associated with it.
C:\Users\Public\conda\envs\geo_env\lib\site-packages\pandas\io\formats\printing.py:118: ShapelyDeprecationWarning: Iteration over multi-part geometries is deprecated and will be removed in Shapely 2.0. Use the `geoms` property to access the constituent parts of a multi-part geometry.
  s = iter(seq)
C:\Users\Public\conda\envs\geo_env\lib\site-packages\pandas\io\formats\printing.py:122: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the  number of parts of a multi-part geometry.
  for i in range(min(nitems, len(seq)))
C:\Users\Public\conda\envs\geo_env\lib\site-packages\pandas\io\formats\printing.py:126: ShapelyDeprecationWarning: __len__ for multi-part geometries is deprecated and will be removed in Shapely 2.0. Check the length of the `geoms` property instead to get the  number of parts of a multi-part geometry.
  if nitems < len(seq):
Scrollable consolidated map: 
Out[63]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Bike station densities have positive but low correlations with population densities, car free commuters% and subway counts. Based on the plots, areas outlined in red could be possible candidates for network expansion as they have low median income, subway connectivity,medium population density and high dependence on public transport.(Motivated from this report)

Contributions:

Introduction -                                               Antarlina
Exploratory Analysis                                       - Antarlina
Research Qs 1:Monthly pickups and weather conditions       - Antarlina
Research Qs 2:Temporal Clustering on pickup volumes        - Lucas
Research Qs 3:Clustering stations on volumes               - Antarlina
Research Qs 4:Bike Stations:: Expansion Opportunity        - Antarlina
Volume Prediction 1: 8 weeks + 1 week                      - Lucas (50%) Antarlina (50%)
Volume Prediction 2: 10 months + 2 months                  - Lucas (70%) Antarlina (30%)
Discussions & Conclusions                                  - Lucas (50%) Antarlina (50%)

Discussion & Conclusions:

Based on the exploration and prediction exercises, it can be reasonably concluded that weather forecasts(RQ:1) might be very useful to predict short spans(such as Model 1) whereas months and seasons(which sums up the weather conditions) were good predictors for Model 2. A decent accuracy of .86 for Model 1 can be improved by addition of weather forecasts. The accuracy steadily declines when used to predict other months as a model cannot predict a bahavior it hasnt seen and therefore has poor generalizability.
Model 2, however,privy to 10 months of data was better equiped to predict 2 months, but fell short as December has a peculiar holiday behavior which the model hadnt learnt from. An accuracy of 0.58 could probably be improved by adding past year's data so that the model learns about the holiday behavior as well. Further, in order to improve class balance, stratified sampling could be performed across the months so that the model can predict each behavior equally well. In both scenarios, predictions at station cluster level(RQ:3) would have improved the accuracy while adding the output of day level clusters(RQ:2) as a flag. Finally, the bike network expansion could be considered( RQ:4) by marrying a study of neighborhood demographics and accurate volume predictions.

In [3]:
import os
print("PYTHONPATH:", os.environ.get('PYTHONPATH'))
print("PATH:", os.environ.get('PATH'))
PYTHONPATH: None
PATH: C:\Users\Public\conda\envs\geo_env;C:\Users\Public\conda\envs\geo_env\Library\mingw-w64\bin;C:\Users\Public\conda\envs\geo_env\Library\usr\bin;C:\Users\Public\conda\envs\geo_env\Library\bin;C:\Users\Public\conda\envs\geo_env\Scripts;C:\Users\Public\conda\envs\geo_env\bin;C:\Users\Public\conda\condabin;C:\Users\Public\conda;C:\Users\Public\conda\Library\mingw-w64\bin;C:\Users\Public\conda\Library\usr\bin;C:\Users\Public\conda\Library\bin;C:\Users\Public\conda\Scripts;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;C:\WINDOWS\System32\WindowsPowerShell\v1.0;C:\WINDOWS\System32\OpenSSH;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files\Intel\WiFi\bin;C:\Program Files\Common Files\Intel\WirelessCommon;C:\Program Files\Git\cmd;C:\Program Files\MATLAB\R2021a\bin;C:\Users\Antarlina Mukherjee\AppData\Local\Microsoft\WindowsApps;C:\Users\Antarlina Mukherjee\AppData\Local\Programs\Microsoft VS Code\bin;C:\Users\Antarlina Mukherjee\Downloads\hugo_0.82.0_Windows-64bit;C:\Users\Antarlina Mukherjee\AppData\Local\Programs\Julia-1.6.0\bin;C:\Users\Antarlina Mukherjee\AppData\Local\atom\bin
In [2]:
import tensorflow
from tensorflow import keras
In [ ]: